A novel Skin lesion prediction and classification technique: ViT‐GradCAM

Abstract Background Skin cancer is one of the highly occurring diseases in human life. Early detection and treatment are the prime and necessary points to reduce the malignancy of infections. Deep learning techniques are supplementary tools to assist clinical experts in detecting and localizing skin lesions. Vision transformers (ViT) based on image segmentation classification using multiple classes provide fairly accurate detection and are gaining more popularity due to legitimate multiclass prediction capabilities. Materials and methods In this research, we propose a new ViT Gradient‐Weighted Class Activation Mapping (GradCAM) based architecture named ViT‐GradCAM for detecting and classifying skin lesions by spreading ratio on the lesion's surface area. The proposed system is trained and validated using a HAM 10000 dataset by studying seven skin lesions. The database comprises 10 015 dermatoscopic images of varied sizes. The data preprocessing and data augmentation techniques are applied to overcome the class imbalance issues and improve the model's performance. Result The proposed algorithm is based on ViT models that classify the dermatoscopic images into seven classes with an accuracy of 97.28%, precision of 98.51, recall of 95.2%, and an F1 score of 94.6, respectively. The proposed ViT‐GradCAM obtains better and more accurate detection and classification than other state‐of‐the‐art deep learning‐based skin lesion detection models. The architecture of ViT‐GradCAM is extensively visualized to highlight the actual pixels in essential regions associated with skin‐specific pathologies. Conclusion This research proposes an alternate solution to overcome the challenges of detecting and classifying skin lesions using ViTs and GradCAM, which play a significant role in detecting and classifying skin lesions accurately rather than relying solely on deep learning models.


INTRODUCTION
A skin lesion is an area of the skin with a different development pattern or texture from the surrounding skin.The primary and secondary skin lesions are the two types into which they can be separated.Primary skin lesions are abnormal skin conditions that may appear during childbirth or progress throughout an individual's life.Secondary skin lesions result from mistreated or irritated primary lesions.Granulation tissue, papules, tumors, and nodules are examples of first-degree skin lesions; scales and ulcers are examples of secondary skin lesions.Malignant cancerous growths may sometimes be lethal. 1A severe kind of skin cancer is melanoma (MEL).One among the many severe skin malignancies brought on by aberrant skin growth of cells is malignant MEL, whose incidence has skyrocketed in the past few years.MEL, another name for cancerous tumors, originates at the melanocyte stage and extends to the top layers of the skin.The body's primary target regions, the face, neck, legs, and arms, are still exposed to sunlight.The World Health Organization (WHO) reports that more than three million other types of cancer diagnoses and over 132 000 occurrences of MEL are identified globally each year. 2 Such severe disorder requires highly effective techniques to predict such classes.Medical pathologists can reduce disruptive noise and get outline data by accurately identifying the edge of the skin lesion. 3Deep learning has improved efficiency in analyzing images with an enormous amount of data with labels.However, because precise label generation necessitates specialist knowledge, acquiring pixel-level annotation for identifying dermoscopic pictures is frequently costly.Various poorly supervised and semi-supervised learning methods were recently put forth for categorization when there is little pixel-level labeled information.Such methods enable precise classification using unlabeled or sparsely tagged information. 4Self-learning is a semi-supervised technique that generates artificial classifications to feed unlabeled samples by learning an instructional algorithm utilizing data with labels.
Moreover, complex challenges with learning that require assistance via conventional techniques based on rules can now be resolved Meanwhile, the deep learning model lacks visualization at the clas-sification and some essential data loss in a fully connected network.
To overcome this issue, Gradient-Weighted Class Activation Mapping (GradCAM) is employed.The significant contribution of this proposal is listed as follows: 1. To employ primary data preprocessing and data augmentation to enrich the spatial feature extraction.
2. Class imbalance is minimized by generating artificial data to learn the proposed ViT-GradCAM Technique better.
3. Augmentation of data with labels for better prediction of the spreading rate of skin lesions based on the class.0] In the subsequent example, we use a Kullback-Leibler divergence loss to push the consequent distribution of probabilities in line with one of the following levels, connecting the categorized heads of every successive level.The experimental results showed an accuracy of about 92.8% and a precision of about 91.53%, and the confusion matrix was generated to validate the proficiency of the proposed model.4][15] However, it has certain drawbacks, detailed in Table 1. 14

Overview
The primary goal of this study is to predict the skin lesion spreading rate and to classify the skin lesion under seven different classes, namely Actinic Ketratoses and intraepithelial carcinoma (AKIEC), Basal Cell Carcinoma (BCC), benign Keratosis lesions (BKL), dermatofibroma (DF), melanoma MEL, melanocytic nevi (NV), and vascular lesions (VASC).The overview of this research process is shown in Figure 1.

Dataset
Human Against machine-10 000 images (HAM 10000 dataset), one  After resizing the image, its sharpness increases by removing the blurring effect.Here, the denoising filter removes the fading effect and the blurriness of skin lesion images in the HAM 10000 dataset.
The Laplacian kernel operation is employed as a sharpening filter function.A Laplacian kernel with a positive core value encircled by negative values in a cross structure was applied to sharpen the data.
The Laplacian kernel function, that is, sharpening the Laplacian filter expression, is expressed in Equation (1).
The sharpening Laplacian filter is well known for improving the edges of images and features despite preserving localized impacts and computing efficiency.However, because of its level of noise vulnerability, there is a chance that distortions may be introduced.Furthermore, the general aesthetic of a skin lesion image may be impacted by susceptibility to minor fluctuations and possible fringe impacts.When it applies to imaging for medical reasons, its usefulness depends on how carefully it is used, taking into account the amount of noise already present, to improve the sharpness of images for training the ViT-GradCAM model.After preprocessing, we go on to the data augmentation process.

Data augmentation
ViT models have been used widely in medical diagnostic applications for higher accuracy.The augmentation of the data on the training dataset images enhances such skillful framework performance.Here, the HAM 10000 dataset has diverse skin lesion images but has a highclass imbalance, as shown in Figure 3. Basic augmentation operations such as flip (horizontal and vertical), shift, rotate, zoom (random), etc., are performed to overcome the imbalance issue.

F I G U R E 3
Data augmentation process of HAM 10000 dataset images.

Vision transformer mechanism
In general, the ViT method is one of the deep learning techniques.
ViT uses a transformer to detect and classify the images in various applications.Especially in medical applications, accuracy is the main parameter for precisely detecting target regions in medical images.
To achieve this goal, the ViT model is employed in this proposal to spread the rate of skin lesions over the patient's skin using the HAM As mentioned above, the transformer encoder in the ViT captures long-range dependency in the number of patches fed into it.
The number of segregated patches is named "tokens".The number of patches/tokens generated is based on the below expression: where N represents the number of tokens generated depending on the image pixel, h and w represents the height and width of the actual image.
Steps of ViT to classify an image: 1.An image gets split into several patches.
2. Then, the patches get compressed into the linear patch using a feedforward layer called linear patch projection.
3.Then, the patches get converted into fixed vector size to produce the patch embedding.
4. The tokens are added with patch embedding for better classification of skin lesions.
5. After concatenating the embedding with tokens, the standard-size vector tokens get fed to the transformer encoder.
6.The transformer encoder extracts the token sequence to classify the patches into corresponding classes.
The overview of the ViT is shown in Figure 4.The patch matrix is in the dimension of P, P which gets compressed into the dimension of 1, P 2 .The feed-forward layer compresses the patch into a compressed matrix G.Then, the linear patch projection is produced in the dimension of P 2 , F. The patches get converted into an embedded patch P e with the fixed vector size F and the matrix dimension of patch embedding is 1, F. The Tokens are added with the patch embedding to feed the single input to the transformer encoder layer.Finally, the embedded patch P e gets added with the learnable token embedding P el which is considered a matrix concatenation, which resolves the issue S class .The concatenated tokens get extracted using the expression.
T o = [S class ; x 1 P e ; x 2 P e ; … … x n P e + P el ] ( 3) The transformer encoder consists of a multilayer MSA (multiheaded self-attention) block and an MLP (Multilayer Perceptron) block.The function of the transformer encoder is mathematically represented as The MSA block performs a self-attention (SA) mechanism, which is the key component in ViT.The SA extracts the key information from the image fed to the transformer encoder.Using the SA mechanism, a ViT model may focus on distinct areas of the input data according to their significance for the corresponding task.For this, the SA mechanism uses key, query, and value concepts.The key values get a dot product with query and are divided by the root of the dimension of the critical value, which is applied by a softmax function to get the weights.
The concatenation of weights from the SA block is fed forward to the feed-forward layer, which is expressed as, Then, the MLP block comprises two consecutive layers and a GELU activation function in between the successive layers.The basic function of MLP is to learn the complex information from the data.The complex data extraction in MLP is mathematically represented as After extracting the class, the output of the transformer encoder gets passed to the final classification layer.In this research, based on the input data, we perform a GradCAM function for better accurate classification and spread rate prediction of skin lesions.

Gradient-Weighted Class Activation Mapping (Grad-CAM)
As in the case of deep learning networks, the final layer neurons consist of spatial information in the image for better detection and classification.Unfortunately, some essential spatial information lasts in FC layers in DL networks.To overcome this issue, Grad-CAM is employed in this research at the final layer of the ViT.Grad-CAM assigns fundamental values to every neuron in the final layer using gradient information.In this section, the output layer function is concentrated to explain the function of the final layer.
For each class of width W and height H, the Grad-CAM belongs to width and height, which is represented as The localization map with a discriminative class is obtained by calculating the backpropagation gradient value for each class based on the activation map features.A k that is . The backpropagation of weights having width and height is indexed with m and n, respectively, to get the weight of essential neuron in the final layer  c .
The weighted combination of activation map followed by ReLU activation mapping is done to obtain the average gradient value to obtain the matrix using the expression 14.The corresponding matrix gets combined with the actual image to represent spatial features.
Here, we employ ReLU activation to the linear combination of weighted maps.Hence, we have to predict the spreading rate of the lesion over the skin, which depends upon the class, that is, the features having the infected region (positive), the intensity of a pixel is high x c when compared with the average skin pixel intensity.The negative pixel represents normal skin.Meanwhile, no ReLU prediction is less accurate and results in an inaccurate prediction rate.

Authors technique Evaluation metric outcomes
Aisuwaidan et al.

Comparative analysis
The proposed ViT-GradCAM model is compared with three benchmarking models such as CNN-based classification of the dermatological disorder, 17 ViT-based skin lesion generation, and classification, 13 and YoTransViT-based skin disease classification. 16The comparative results depict that the proposed ViT-GradCAM achieved higher order accuracy of about 97.2%, the precision of about 98.

CONCLUSION
The classification of skin lesions under seven different categories becomes challenging because of the high similarity index among all the categories.The primary identification method is visual, commencing with medical screening and progressing through dermoscopic analysis, histopathological evaluation, and specimen acquisition.Deep learning approaches accomplish significantly separated and presumably ubiquitous activities when applied to a categorized extremely fine object.This proposal proposes a novel ViT technique that integrates the Gradient- thanks to deep learning.The efficiency of deep learning-based algorithms on a range of complex computer vision and image classification assignments is almost on par with the abilities of humans.Therefore, deep-learning algorithms are frequently used in healthcare imaging for various reasons, such as sickness diagnosis. 5Yet, deep structures must acquire many training cases to obtain valuable depictions.Creating extensive medical image datasets for supervised learning is more complicated than other applications.Purchasing and labeling are costly and time-consuming, requiring specialized equipment and trained medical personnel.One of the main problems with contemporary deep learning and computer vision systems is data.There is insufficient data about skin conditions since many skin lesions and distinct characteristics exist.Due to such distinct characteristics, we proposed a novel classification and prediction technique called the ViT-GradCAM model.We employed the vision transformer (ViT) technique to extract information from the images by converting it into the number of patches.

4 . 5 .
A gradient gradient-weighted backpropagation process is carried out to avoid the loss of spatial information at the final layer.To create a powerful, intuitive online tool that can identify and categorize skin lesions based on data collected in real-time, helping medical professionals to receive an initial diagnosis.The workflow of this framework is described as follows: Section 2 has the literature survey of some benchmarking skin disease classification techniques.Section 3 provides the methodology and mathematical model of the proposed ViT-Grad CAM architecture.Section 4 shows the proposed model's proficiency by conducting experiments.The result shows that the proposed model outperforms other conventional classifying algorithms.Section 5 provides the conclusion of this framework.

11
Goseri et al. (2020) presented image augmentation techniques based on deep learning.It solves class imbalance and the image scarcity issue for training deep learning models.Moreover, the overfitting issue is discussed, and a solution for resolving the fitting problem is offered. 12Ayas (2022) presented the swim transformer model-based classification of skin diseases.They focused on efficiently learning spatial data from the training images to enhance classification accuracy.

F I G U R E 1
Overview of ViD-GradCAM.F I G U R E 2 Different skin lesion images in the HAM 10000 dataset.The HAM 10000 dataset is selected to train and validate the proposed model.Due to class imbalance, data preprocessing and data augmentation have been done to overcome this issue.As a result, the number of samples from each class is enhanced, and the resultant sample is sufficient to train the ViT-GradCAM model.After preprocessing and augmentation, the samples get fed to the transformer encoder, which generates the patch embedding and concatenates it with the tokens to enhance classification and prediction accuracy.At the final layer, the GradCAM technique is performed to extract the spatial features for visualization using the gradient weights generated by the backpropagation in the Grad-CAM model.As a result, the spreading rate is evaluated based on the intensity of the patch and the actual class.
of the diverse datasets, has been chosen for training the proposed ViT-GradCAM.https://www.kaggle.com/datasets/surajghuwalewala/ham1000-segmentation-and-classification.The dataset has 10 015 images of skin lesions, which consist of seven different types of skin diseases, namely AKIEC, BCC, BKL, DF, MEL, NV, and VASC and the number of images in each class is about 327, 514, 1099, 115, 1113, 6705, and 142, respectively. 20Some of the sample images in the HAM 10000 dataset are shown in Figure 2. The diverse HAM 10000 dataset has high data imbalance in each class.To overcome this imbalance dataset, data augmentation is done in this research to generate the synthetic training images for better training of the proposed model.As a result of data augmentation, a wide number of images in each class are generated, which helps enhance classification accuracy.
10000 dataset images, which are preprocessed and augmented.It has been demonstrated that ViT can produce cutting-edge outcomes on several image identification targets, including Image Net.They are proficient in various image recognition tasks, such as scenarios and object identification.ViT performs exceptionally well but has a few improvements beyond other deep-learning algorithms for classifying images.For instance, these algorithms need no human resizing or cropping when handling input photographs of any size, and they can be trained with comparatively minimal quantities of data; these qualities render them a viable option for image detection and classification in practical applications.ViT network uses a self-attention mechanism to depict long-range interdependence in the dataset image, which helps to enhance the spreading rate prediction accuracy and skin cancer classification accuracy.Instead of convolution layers in DL models, ViT uses a transformer encoder layer.The classification is performed using the transformer encoder, Multi-Head Attention blocks, to enhance accurate feature extraction in the target region.Unlike traditional transformers, an input data image that can be reduced and converted into N patches is fed into the ViT architecture as a series of linear embedded data of the segregated patches.

F I G U R E 5 Figure 7
Figure 5B.The loss gets reduced when the number of epochs increases.Based on the learning rate, the loss gets diminished.Here, the loss percentage is less for the proposed ViT-GradCAM model when compared with the other two models.The performance validation of ViT-GradCAM is done by comparing the testing and training loss, as shown in Figure 6A.Then, the test and training accuracy is shown in Figure 6B. Figure 6A depicts that the loss decreases while the number of Epochs increases.Meanwhile, Figure 6B shows that the prediction accuracy gets enhanced by increasing the number of epochs.As a result, the prediction of the spreading rate for each sample gets enhanced.The categorization approach faces many difficulties because images in the seven classes appear identical.Even though skin lesion prediction and classification have been the subject of multiple published research, the suggested model's stability was more precise.Understanding how DL techniques perform throughout training requires understanding learning diagrams representing the testing and training mechanisms.The graphs we create then allow us to 5%, recall of about 95.5%, and F1 score of about 94.6%.YoTransViT model achieved the second highest accuracy, and the CNN model achieved the lowest accuracy due to the loss of essential spatial information in the fully connected layer shown in Figure 10.The overall comparison outcomes are listed in Table 4.As in the proposed model, the Grad-CAM effectively retains spatial information and back-propagates it to the classification layer, and the appropriate class gets classified accurately.
Comparative analysis of existing skin lesion classification techniques.

Table 2
Different classes with a number of images in the HAM 10000 dataset.
the image quality by eliminating noise added to the training data.It removes the noise; thereby, the detection accuracy of ML models is improved.Generally, Preprocessing facilitates the effective extraction of target features from images and helps address problems like class in dataset imbalance, resulting in more reliable and accurate predic-TA B L E 2 Abbreviations: AKIEC, Actinic Ketratoses and intraepithelial carcinoma; BCC, Basal Cell Carcinoma; BKL, benign Keratosis lesions; DF, dermatofibroma; MEL, melanoma; NV, melanocytic nevi; VASC, vascular lesions.tions.The HAM 10000 dataset consists of 10 015 images, which differ in size.To achieve consistency, all the images get resized into pixel size 224 × 224 without loss in disease-affected regions.